import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
# ## Plotly plotting support
# import plotly.plotly as py
import plotly.offline as py
py.init_notebook_mode(connected=False)
import plotly.graph_objs as go
import plotly.figure_factory as ff
import cufflinks as cf
cf.set_config_file(offline=False, world_readable=True, theme='ggplot')
In this lecture we examine the process of data cleaning and Exploratory Data Analysis (EDA). Often you will acquire or even be given a collection of data in order to conduct some analysis or answer some questions. The first step in using that data is to ensure that it is in the correct form (cleaned) and that you understand its properties and limitations (EDA). Often as you explore data through EDA you will identify additional transformations that may be required before the data is ready for analysis.
In this notebook we obtain crime data from the city of Berkeley's public records. Ultimately, our goal might be to understand policing patterns but before we get there we must first clean and understand the data.
To begin this analysis we want to get data about crimes in Berkeley. Remarkably, the city of Berkeley maintains an Open Data Portal for citizens to access data about the city. We will be examining the:
Fortunately, this data is also relatively well document with detailed descriptions of what it contains. Here are summaries of the fields in the data:


Unfortunately, data is seldom well documented and when it is you may not be able to trust the documentation. It is therefore critical that when we download the data we investigate the fields and verify that it reflects the assumptions made in the documentation.
from utils import fetch_and_cache
help(fetch_and_cache)
I can actually look at the source code in the notebook.
fetch_and_cache??
Occasionally, you will want to modify code that you have imported. To reimport those modifications you can either use the python importlib library:
from importlib import reload
reload(utils)
or use iPython magic which will intelligently import code when files change:
%load_ext autoreload
%autoreload 2
Notice that because I record how I got the data in the notebook, others can reproduce this experiment. However, it is worth noting that the data can change. We will want to pay attention to file timestamps.
calls_file = fetch_and_cache("https://data.cityofberkeley.info/api/views/k2nh-s5h5/rows.csv?accessType=DOWNLOAD",
"calls_for_service.csv")
stops_file = fetch_and_cache("https://data.cityofberkeley.info/resource/4p7k-drdw.json?$limit=100000",
"stops.json")
To demonstrated live downloading (of something we won't need in lecture) we can download professor Gonzalez's website.
gonzalez_file = fetch_and_cache("https://people.eecs.berkeley.edu/~jegonzal/assets/jegonzal.jpg",
"gonzalez.jpg",
force=True)
Now that we have obtained the data we want to understand its:
Before we even begin to load the data it often helps to understand a little about the high-level structure:
I often like to start my analysis by getting a rough estimate of the size of the data. This will help inform the tools I use and how I view the data. If it is relatively small I might use a text editor or a spreadsheet to look at the data. If it is larger, I might jump to more programmatic exploration or even used distributed computing tools.
type(stops_file)
help(stops_file)
stops_file.stat()
calls_file.stat().st_size
All the files are relatively small and we could comfortable examine them in a text editors. (Personally, I like sublime or emacs but others may have a different view.).
In listing the files I noticed that the names suggest that they are all text file formats:
We will dive into the formats in a moment. However because these are text data I might also want to investigate the number of lines which often correspond to records.
from utils import line_count
help(line_count)
print("Lines in the calls file", line_count(calls_file))
print("Lines in the stops file", line_count(stops_file))
If your iPython session is running on a POSIX compliant machine (e.g., mac/linux) you could also run:
!wc data/*
which would produce (lines, words, characters):
16053 85512 976004 data/calls_for_service.csv
148 1541 54988 data/gonzalez.jpg
29850 658525 6123225 data/stops.json
46051 745578 7154217 total
# !wc data/*
If we wanted to learn more about the command we could use the man (short for manual) command to read the manual page for wc (word count).
Meta-data is data about the data. Most filesystems will maintain information about when a file was created, read, or modified. Below we use the Python time library to print the time in the local representation. Note, time is often measured in Unix Time (time in second since January 1st, 1970).
import time
print("Time of creation (seconds since Epoch):",
stops_file.stat().st_ctime)
print("Local time of creation:",
time.ctime(stops_file.stat().st_ctime))
We already noticed that the files end in csv and json which suggests that these are comma separated and javascript object files respectively. However, we can't always rely on the naming as this is only a convention. For example, here we picked the name of the file when downloading based on some hints in the URL.
Often files will have incorrect extensions or no extension at all.
Let's assume that these are text files (and do not contain binary encoded data) so we can print a "few lines" to get a better understanding of the file.
from utils import head
help(head)
head(calls_file)
Note: Because the gonzalez_file is a JPEG (image) it is not a character file format and looking at the head of the file isn't going to work:
head(stops_file)
# head(gonzalez_file)
Calls data?¶\n characters:Block_Location column:
"2500 LE CONTE AVE
Berkeley, CA
(37.876965, -122.260544)"
these are going to be difficult. What are the implications on our earlier line count calculations?head(calls_file)
Stops data?¶This appears to be a fairly standard JSON file. We notice that the file appears to contain a description of itself in a field called "meta" (which is presumably short for meta-data). We will come back to this meta data in a moment but first let's quickly discuss the JSON file format.
head(stops_file, lines=20)
JSON (JavaScript Object Notation) is a common format for exchanging complex structured and semi-structured data.
{
"field1": "value1",
"field2": ["list", "of", "values"],
"myfield3": {"is_recursive": true, "a null value": null}
}
A few key points:
d = {
"field1": "value1",
"field2": ["list", "of", "values"],
"myfield3": {"is_recursive": True, "a null value": None}
}
print(d['myfield3'])
We will now attempt to load the data into python. We will be using the Pandas dataframe library for basic tabular data analysis. Fortunately, the Pandas library has some relatively sophisticated functions for loading data.
Because the file appears to be a relatively well formatted CSV we will attempt to load it directly and allow the Pandas Library to deduce column headers. (Always check that first row and column look correct after loading.)
calls = pd.read_csv(calls_file)
calls.head()
How many records did we get?
calls.size
EVENTDT -- Contain the incorrect time stampEVENTTM -- Contains the time in 24 hour format (What timezone?)CVDOW -- Appears to be some encoding of the day of the week (see data documentation).InDbDate -- Appears to be correctly formatted and appears pretty consistent in time.Block_Location -- Errr, what a mess! newline characters, and Geocoordinates all merged!! Fortunately, this field was "quoted" otherwise we would have had trouble parsing the file. (why?)BLKADDR -- This appears to be the address in Block Location.City and State seem redundant given this is supposed to be the city of Berkeley dataset.We notice that there are city and state columns. Since this is supposed to be data for the city of Berkeley these columns appear to be redundant. Let's quickly compute the number of occurences of unique values for these two columns.
calls.groupby(["City", "State"]).count()
According to the documentation CVDOW=0 is Sunday, CVDOW=1 is Monday, ..., Therefore we can make a series to decode the day of the week for each record and join that series with the calls data.
dow = pd.Series(["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"], name="DoW")
dow
calls['Day'] = dow[calls['CVDOW']].reset_index(drop=True)
calls.head()
We can also accomplish the same task using a join.
df_dow = pd.DataFrame(dow)
pd.merge(calls, df_dow, left_on='CVDOW', right_index=True).head()
The block location contains the GPS coordinates and I would like to use these to analyze the location of each request. Let's try to extract the GPS coordinates using regular expressions (we will cover regular expressions in future lectures):
calls['Block_Location'][0]
calls_lat_lon = (
# Remove newlines
calls['Block_Location'].str.replace("\n", "\t")
# Extract Lat and Lon using regular expression
.str.extract(".*\((?P<Lat>\d*\.\d*)\, (?P<Lon>-?\d*.\d*)\)",expand=True)
)
calls_lat_lon.head(10)
Double check that all of the records have a latitude and longitude:
calls_lat_lon.isnull().sum()
The following block of code joins the extracted Latitude and Longitude fields with the calls data. Notice that we actually drop these fields before joining. This is to enable repeated invocation of this cell even after the join has been completed.
# Remove Lat and Lon if they already existed before (reproducible)
calls.drop(["Lat", "Lon"], axis=1, inplace=True, errors="ignore")
# Join in the the latitude and longitude data
calls = calls.merge(calls_lat_lon, left_index=True, right_index=True)
# calls[["Lat", "Lon"]] = calls_lat_lon
# calls.join(calls_lat_lon)
calls.head()
We can now look at a few of the records that were missing latitude and longitude entries:
calls[calls['Lat'].isnull()].head()
Python has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary.
import json
with open("data/stops.json", "rb") as f:
stops_json = json.load(f)
The stops_json variable is now a dictionary encoding the data in the file:
type(stops_json)
We can list the keys to determine what data is stored in the object.
stops_json.keys()
The JSON dictionary contains a meta key which likely refers to meta data (data about the data). Meta data often maintained with the data and can be a good source of additional information.
We can investigate the meta data further by examining the keys associated with the metadata.
stops_json['meta'].keys()
The meta key contains another dictionary called view. This likely refers to meta-data about a particular "view" of some underlying database. We will learn more about views as we study SQL later in the class.
stops_json['meta']['view'].keys()
Notice that this a nested/recursive data structure. As we dig deeper we reveal more and more keys and the corresponding data:
meta
|-> data
| ... (haven't explored yet)
|-> view
| -> id
| -> name
| -> attribution
...
There is a key called description in the view sub dictionary. This likely contains a description of the data:
print(stops_json['meta']['view']['description'])
Another potentially useful key in the meta data dictionary is the columns. This returns a list:
type(stops_json['meta']['view']['columns'])
We can brows the list using python:
for c in stops_json['meta']['view']['columns']:
print(c["name"], "----------->")
if "description" in c:
print(c["description"])
print("======================================\n")
We can look at a few entires in the data field
stops_json['data'][0:2]
In the following block of code we:
# Load the data from JSON and assign column titles
stops = pd.DataFrame(
stops_json['data'],
columns=[c['name'] for c in stops_json['meta']['view']['columns']])
stops.head()
What do we observe so far?
We observe:
Incident Number appears to have the year encoded in it. Call Date/Time Field looks to be formatted in YYYY-MM-DDTHH:MM:SS. I am guessing T means "Time".Incident Type has some weird coding. This is define in the column meta data:
This is the occurred incident type created in the CAD program. A code signifies a traffic stop (T), suspicious vehicle stop (1196), pedestrian stop (1194) and bicycle stop (1194B).Recall the description:

Now that we have loaded our various data files. Let's try to understand a bit more about the data by examining properties of individual fields.
calls.head()
We notice that there are city and state columns. Since this is supposed to be data for the city of Berkeley these columns appear to be redundant. Let's quickly compute the number of occurences of unique values for these two columns.
calls.groupby(["City", "State"]).count()
Case numbers are probably used internally to track individual cases and my reference other data we don't have access to. However, it is possible that multiple calls could be associated with the same case. Let's see if the case numbers are all unique.
print("There are", len(calls['CASENO'].unique()), "unique case numbers.")
print("There are", len(calls), "calls in the table.")
Are case numbers assigned consecutively.
calls['CASENO'].sort_values().reset_index(drop=True).plot()
plt.xlabel("Ordered Location in Data")
plt.ylabel("Case Number")
I like to use interactive plotting tools so I can hover the mouse over the plot and read the values. The cufflinks library adds plotly support to Pandas.
calls['CASENO'].sort_values().reset_index(drop=True).iplot(
layout = dict(yaxis=dict(title="Case Number", hoverformat="d"),
xaxis=dict(title="Ordered Location in Data")))
Examining the distribution of case numbers shows a similar pattern
calls['CASENO'].iplot(kind="hist",bins=100)
One possible explanation is that case numbers were assigned consecutively and then sampled uniformly at different rates for two different periods. We will be able to understand this better by looking at the dates on the cases.
Given the weird behavior with the case numbers let's dig into the date in which events were recorded. Notice in this data we have several pieces of date/time information (this is not uncommon):
EVENTDT: This contains the date the event took place. While it has time information the time appears to be 00:00:00. EVENTTM: This contains the time at which the event took place.InDbDate: This appears to be the date at which the data was entered in the database. calls.head(3)
When Pandas loads more complex fields like dates it will often load them as strings:
calls["EVENTDT"][0]
We will want to convert these to dates. Pandas has a fairly sophisticated function pd.to_datetime which is capable of guessing reasonable conversions of dates to date objects.
dates = pd.to_datetime(calls["EVENTDT"])
dates[0]
We can verify that the translations worked by looking at a few dates:
pd.DataFrame(dict(transformed=dates, original=calls["EVENTDT"])).head()
We can also extract the time field:
times = pd.to_datetime(calls["EVENTTM"]).dt.time
times.head()
To combine the correct date and correct time field we use the built-in python datetime combine function.
from datetime import datetime
timestamps = pd.to_datetime(pd.concat([dates, times], axis=1).apply(
lambda r: datetime.combine(r['EVENTDT'], r['EVENTTM']), axis=1))
timestamps.head()
We now updated calls to contain this additional informations:
calls['timestamp'] = timestamps
calls.head()
calls['timestamp'].min()
calls['timestamp'].max()
(
calls
.sort_values('timestamp')
.iplot(
mode='markers', size=5,
x='timestamp', y='CASENO',
layout = dict(yaxis=dict(title="Case Number", hoverformat="d"),
xaxis=dict(title="Date of Call")))
)
Perhaps there are multiple different record books with different numbering schemes? This might be something worth investigating further.
In the above analysis of case numbers all we can really conclude is that we don't understand how this variable was created. The patterns in the values suggest that there may be some sampling and it is possibly worth contacting the data provider to learn more.
Do more calls occur on a particular day of the week?
dow = ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"]
calls.groupby('Day')['CASENO'].count()[dow].iplot(kind='bar', yTitle="Count")
ts = calls['timestamp']
minute_of_day = ts.dt.hour * 60 + ts.dt.minute
hour_of_day = minute_of_day / 60.
calls['minute_of_day'] = minute_of_day
calls['hour_of_day'] = hour_of_day
py.iplot(ff.create_distplot([hour_of_day],group_labels=["Hour"],bin_size=1))
In the above plot we see the standard pattern of limited activity early in the morning around here 6:00AM.
In the above plot we see a smoothed curve approximating the histogram. This is an example of a kernel density estimator (KDE). The KDE, like the histogram, has a parameter that determines it's smoothness. Many packages (Plotly and Seaborn) use a boostrap like procedure to choose the best value.
To understand how this parameter works we will use seaborn which let's you also set the parameter manually.
sns.distplot(hour_of_day, rug=False)
Notice in the above plot that the interpolation tries to follow the histogram to closely.
To better understand the time of day a report occurs we could stratify the analysis by the day of the week. To do this we will use box plots.
## Using Pandas built in box plot
# calls['hour_of_day'] = minute_in_day
# calls.boxplot('hour_of_day', by='Day')
groups = calls.groupby("Day")
boxes = {i: go.Box(y=df["hour_of_day"], name=i) for (i, df) in groups}
py.iplot([boxes[d] for d in dow])
py.iplot(ff.create_violin(calls, data_header='hour_of_day', group_header="Day"))
There are no very clear patterns here. It is possible Fridays might have crimes shifted slightly more to the evening. In general the time a crime is reported seems fairly consistent throughout the week.
calls.head()
The Offense field appears to contain the specific crime being reported. As nominal data we might want to see a summary constructed by computing counts of each offense type:
calls['OFFENSE'].value_counts(dropna=False).iplot(kind="bar")
Car burglary and misdemeanor theft seem to be the most common crimes with many other types of crimes occurring rarely.
The CVLEGEND filed provides the broad category of crime and is a good mechanism to group potentially similar crimes.
calls['CVLEGEND'].value_counts(dropna=False).iplot(kind="bar")
Notice that when we group by the crime time we see that larceny becomes the dominant category. Larceny is essentially stealing -- taking someone else stuff without force.
View the crime time periods broken down by crime type:
boxes = [(len(df), go.Box(y=df["hour_of_day"], name=i)) for (i, df) in calls.groupby("CVLEGEND")]
py.iplot([r[1] for r in sorted(boxes, key=lambda x:x[0], reverse=True)])
py.iplot(ff.create_distplot([
calls[calls['CVLEGEND'] == "NOISE VIOLATION"]['hour_of_day'],
calls[calls['CVLEGEND'] == "DRUG VIOLATION"]['hour_of_day'],
calls[calls['CVLEGEND'] == "LIQUOR LAW VIOLATION"]['hour_of_day'],
calls[calls['CVLEGEND'] == "FRAUD"]['hour_of_day']
],
group_labels=["Noise Violation", "Drug Violation", "Liquor Violation", "Fraud"],
))
Let's examine the geographic data (latitude and longitude). Recall that we had some missing values. Let's look at the behavior of these missing values according to crime type.
missing_lat_lon = calls[calls[['Lat', 'Lon']].isnull().any(axis=1)]
missing_lat_lon['CVLEGEND'].value_counts().iplot(kind="bar")
There is a clear bias towards drug violations that is not present in the original data. Therefore we should be careful when dropping missing values!
We might further normalize the analysis by the frequency to find which type of crime has the highest proportion of missing values.
(
missing_lat_lon['CVLEGEND'].value_counts() / calls['CVLEGEND'].value_counts()
).sort_values(ascending=False).iplot(kind="bar")
We might also want to examine the time of day in which latitude and longitude are missing:
ts = calls['timestamp']
py.iplot(ff.create_distplot([calls['timestamp'].dt.dayofyear,
missing_lat_lon['timestamp'].dt.dayofyear],
group_labels=["all data", "missing lat-lon"],bin_size=2))
Pretty similar
!pip install --upgrade folium
import folium
print(folium.__version__, "should be at least 0.3.0")
import folium
import folium.plugins # The Folium Javascript Map Library
SF_COORDINATES = (37.87, -122.28)
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=13)
locs = calls[['Lat', 'Lon']].astype('float').dropna().as_matrix()
heatmap = folium.plugins.HeatMap(locs.tolist(), radius = 10)
sf_map.add_child(heatmap)
cluster = folium.plugins.MarkerCluster()
for _, r in calls[['Lat', 'Lon', 'CVLEGEND']].tail(1000).dropna().iterrows():
cluster.add_child(
folium.Marker([float(r["Lat"]), float(r["Lon"])], popup=r['CVLEGEND']))
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=13)
sf_map.add_child(cluster)
sf_map